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Abstract 

In data mining applications, feature selection is an essential process since it 
reduces a model's complexity. The cost of obtaining the feature values must 
be taken into consideration in many domains. In this paper, we study the 
cost-sensitive feature selection problem on numerical data with measurement 
errors, test costs and misclassification costs. The major contributions of this 
paper are four-fold. First, a new data model is built to address test costs and 
misclassification costs as well as error boundaries. Second, a covering-based 
rough set with measurement errors is constructed. Given a confidence interval, 
the neighborhood is an ellipse in a two-dimension space, or an ellipsoidal in a 
three-dimension space, etc. Third, a new cost-sensitive feature selection problem 
is defined on this covering-based rough set. Fourth, both backtracking and 
heuristic algorithms are proposed to deal with this new problem. The algorithms 
are tested on six UCI (University of California - Irvine) data sets. Experimental 
results show that (1) the pruning techniques of the backtracking algorithm help 
reducing the number of operations significantly, and (2) the heuristic algorithm 
usually obtains optimal results. This study is a step toward realistic applications 
of cost-sensitive learning. 

Keywords: cost-sensitive learning; measurement error; misclassification cost; 
test cost; feature selection. 



1. Introduction 

Feature selection is an essential process in data mining applications. The 
main aim of feature selection is to reduce the dimensionality of the feature space 
and improve the predictive accuracy of a classification algorithm [T71 135] . In the 
feature selection process, the misclassification costs and the costs of obtaining 
the feature values must be considered in many domains. Cost-sensitive feature 
selection focuses on selecting a feature subset with a minimal total cost as well 
as preserving a particular property of the decision system [351 121j . 
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Test costs and misclassification costs are two most important types in cost- 
sensitive learning [23]. Test cost is the money, time, or other resources we pay 
for coUecting a data item of an object [37l[22]. The misclassification cost is the 
penalty we receive while deciding that an object belongs to class J when its real 
class is K [SJ [7] . Some researches have considered only misclassification costs 
PU] . or test costs [Ml [571 • However, in many applications, it is important to 
consider both types of costs together. 

Recently, one minimal cost feature selection approach considering the test 
costs and the misclassification costs was proposed in [23]. A backtracking algo- 
rithm is presented to address the cost-sensitive feature selection problem. The 
performance of the algorithm is satisfactory. The minimal cost feature selection 
problem has been successfully addressed with nominal data. In real applications, 
the data can be acquired from measurements with different measurement errors. 
The measurement error of the data has certain universality and is inescapability. 

In this paper, we propose cost-sensitive feature selection for data with mea- 
surement errors through considering the trade-off between test costs and mis- 
classification costs. The major contributions of this paper are four-fold. First, 
based on measurement errors, we build a new data model to address error bound- 
aries and test costs as well as misclassification costs. Second, we construct the 
computational model of the covering-based rough set with measurement errors. 
Given a confidence interval, the neighborhood is an ellipse in a two-dimension 
space, or an ellipsoidal in three-dimension, etc. Compared with the fix neigh- 
borhood, the proposed neighborhood is computed according to the values of 
attributes. Third, the cost-sensitive feature selection problem is defined on this 
new model of covering-based rough set. Fourth, both backtracking and heuristic 
algorithms are proposed to deal with this feature selection problem. 

Six open data sets from the UCI library are employed to study the perfor- 
mance and effectiveness of our algorithms. Experiments undertaken with open 
source software Coser [25] validate the performance of this algorithm. Experi- 
mental results show that (1) the backtracking algorithm can significantly reduce 
the number of searching operations; (2) the heuristic algorithm can obtain the 
optimal result for almost all test instances in less time. 

The rest of the paper is organized as follows: Section [2] presents data models 
with test costs and misclassification costs as well as measurement errors. Section 
[Sj describes the computational model, namely covering-based rough set model 
with measurement errors. The minimal cost feature selection problem under the 
new model is also defined in this section. Then Section|4|presents a backtracking 
algorithm and a heuristic algorithm to address this feature selection problem. 
In Section [5] we discuss the experimental settings and results. Finally, Section 
[6] concludes and suggests further research trends. 

2. Data models 

Data models are presented in this section. First, we start from basic decision 
systems. Second, we introduce normal distribution errors to tests, and propose 
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a decision system with measurement errors. Finally we introduce a decision 
system based on measurement errors with test costs and misclassification costs. 

2.1. Decision systems 

A decision system is defined below. 

Definition 1. J^^f A decision system (DS) is the 5-tuple: 

S = {U, C,d,V = {Va\a ecu {d}}, I = {Ia\aeCU {d}}), (1) 

where U is a universal set of objects, C is a nonempty set of conditional at- 
tributes, d is the decision attribute. For each a G CU {d}, la '. U Va. The 
set Va is the value set of attribute a, and la is an information function for each 
attribute a. 

In order to help processing and comparison, the values of conditional at- 
tributes are normalized from their value into a range from to 1. In fact, there 
are a number of normalization approaches. For simplicity, we employ the linear 
function y = ix — min)/(max — min), where x is the initial value, y is the 
normalized value, and max and min are the maximal and minimal values of the 
attribute domain, respectively. 

Table [l] is a decision system of Bupa liver disorder {Liver for short), which 
conditional attributes are normalized values. Here C ={Mcv, Alkphos, Sgpt, 
Sgot, Gammagt, Drinks}, d ^{Selector}, and U — {xi,X2, ■ . ■ , 2^345}- 

Liver contains 7 attributes. The first 5 attributes are all blood tests which 
are thought to be sensitive to liver disorders that might arise from excessive 
alcohol consumption. The sixth attribute is the number of alcoholic drinks 
per day. Each line in Liver constitutes the record of a single male individual. 
Selector attribute is used to split data into two sets. 



Table 1: An example numerical decision system (Liver). 



Patient 


Mcv 


Alkphos 


Sgpt 


Sgot 


Gammagt 


Drinks 


Selector 


Xi 


0.31 


0.23 


0.08 


0.28 


0.09 


0.00 


y 


Xl 


0.14 


0.38 


0.23 


0.35 


0.06 


0.10 


y 


3^3 


0.25 


0.40 


0.40 


0.14 


0.17 


0.20 


y 


Xi 


0.60 


0.46 


0.51 


0.25 


0.11 


0.60 


n 


X5 


0.41 


0.64 


0.62 


0.30 


0.02 


0.30 


n 


Xe 


0.35 


0.50 


0.75 


0.30 


0.02 


0.40 


n 




0.68 


0.39 


0.15 


0.23 


0.03 


0.80 


n 


X3i5 


0.87 


0.66 


0.35 


0.52 


0.21 


1.00 


n 
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2.2. A decision system with measurement errors 

In real applications, data sets often contain many continuous (or numeric) 
attributes. There are a number of measurement methods with different test 
costs to obtain a numerical data item. Generally, higher test cost is required 
to obtain data with smaller measurement error [48 . The measurement errors 
often satisfy normal distribution which is found to be applicable over almost the 
whole of science and engineering measurement. We include normal distribution 
measurement errors in our model to expand the application scope. 

Definition 2. JJ^ A decision system with measurement errors (MEDS) S is 
the 6-tuple: 

S^{U,C,d,V,I,n), (2) 

where U, C, d, V , and I have the same meanings as in Definition [7| n : C — > 
U {0} is the maximal measurement error function, and ±n(a) is the 
error boundary of attribute a. 

Given Xi G U, the error boundary of attribute a is given by 

AS^ia(a;,) 

n{a) = , (3) 

the regulator factor A G [0,1] can adjust the error boundary. 

Recently, the concept of neighborhood (see, e.g., [13 [H]) has been applied 
to define different types of covering-based rough set [47l ESj EH [1^ . A neighbor- 
hood based on fix error range is defined in [24]. Although showing similarities, 
it is essentially different from ours. The proposed neighborhood is considered 
the distribution of the data error and the confidence interval. The neighbor- 
hood boundaries for different attributes of the same database are completely 
different. An example neighborhood boundary vector is listed in Table [2j 

Table 2: An example neighborhood boundary vector. 



a 


Mcv 


Alkphos 


Sgpt 


Sgot 


Gammagt 


Drinks 


n{a) 


0.069 


0.087 


0.086 


0.036 


0.026 


0.017 



2.3. A decision system based on measurement errors with test costs and mis- 
classification costs 

In many applications, the test cost must be taken into account [42j . Test 
cost is the money, time, or other resources we pay for collecting a data item of 
an object [371 [HI 1311 HO] . In addition to the test costs, it is also necessary 
to consider misclassification costs. A decision cannot be made if the misclassi- 
fication costs are unreasonable [42]. More recently, researchers have begun to 
consider both test and misclassification costs [23l HI [10] . 
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Now we take into account both test and misclassification costs as well as nor- 
mal distribution measurement errors. A decision system based on measurement 
errors with test costs and misclassification costs is defined as follows. 

Definition 3. A decision system based on measurement errors with test costs 
and misclassification costs (MEDS-TM) S is the 8-tuple: 

S ^ {U,C,d,V,I,n,tc,mc), (4) 

where U, C, d, V, I and n have the same meanings as Definition^ tc:C^M+\J 
{0} is the test cost function and mc : k x k ^ U {0} is the misclassification 
cost function, where k = \Id\- 

Here we consider only the sequence-independent test-cost-sensitive decision 
system. There are a number of test-cost-sensitive decision systems. A hierarchy 
of decision systems consisting of six models were proposed in [22j . For any 
B C C, the test cost function tc is given by tc{B) = J2aGB tc{a). 

The test cost function can be stored in a vector, an example of text cost 
vector is listed in Tabled 



Tabic 3: An example of test cost vector. 



a 


Mcv 


Alkphos 


Sgpt 


Sgot 


Gammagt 


Drinks 


tc{a) 


$26 


$17 


$34 


$45 


$38 


$5 



The misclassification cost [T31 [351 H] is the penalty we receive while deciding 
that an object belongs to class i when its real class is j [S]. The misclassification 
cost function mc is defined as follows: 

1. mc : k X k ^ K+ U {0} is the misclassification cost function, which can 
be represented by a matrix MC = {mc^x*;}, where k = |/^|. 

2. mc[m, n] is the cost of misclassifying an example from "class m" to "class 
n". 

3. mc[m, to] = 0. 

The following example gives us an intuitive understanding of the decision 
system based on measurement errors with test costs and misclassification costs. 

Example 4. Table [7] is a Liver decision system. Tables [1| and are error 
boundary vector and test cost vector of Liver decision system, respectively. 



2000 
200 



(5) 



That is, the test costs of Mcv, Alkphos, Sgpt, Sgot, Gammagt, and Drinks are 
$26, $17, $34, $45, $38, and $5 respectively. In Liver data set, selector field 
is used to split data into two sets. Here, a false negative prediction (FN), i.e. 
failing to detect liver disorders, may well have fatal consequences. Whereas 
a false positive prediction (FP), i.e. diagnosing liver disorders for a patient 
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that does not actually have them, may be less serious fJ^/. Therefore, a higher 
penalty of $2000 is paid for FN prediction and $200 is paid for FP prediction. 



Obviously, if tc and rac are not considered, a MEDS-TM degrades to a 
decision system with measurement errors (MEDS) (see, e.g., [IS]. Therefore the 
MEDS-TM is a generalization of the MEDS. 

3. Covering-based rough set with measurement errors 

As a technique to deal with granularity in information systems, rough set 
theory was proposed by Pawlak [28] . Since then we have witnessed a system- 
atic, world-wide growth of interest in rough set theory [21 21 [SI HH [321 1131 Ull 

[SH [551 [SS] and its applications [H [Mj . Recently, there has been growing 
interest in covering-based rough set. In this section, we introduce normal dis- 
tribution measurement errors to covering-based rough set. The new model is 
called covering-based rough set with measurement errors. Then we define a new 
cost-sensitive feature selection problem on this covering-based rough set. 

3.1. Covering-based rough set with measurement errors 

The covering-based rough set with measurement errors is a natural extension 
of the classical rough set. If all attributes are error free, the covering-based rough 
set model degenerates to the classical one. With the definition of the MEDS, a 
new neighborhood is defined as follows. 

Definition 5. J^l Let S = {U,C,d,V, I,n) be a decision system with measure- 
ment errors. Given B Q C and Xi G U , the neighborhood of Xi with reference 
to measurement errors on the feature set B is defined as 



That means the value of measurement error of attribute a in [— n(a), +n(a)]. 
According to Definition [5] we know that the neighborhood nsixi) is the inter- 
section of multiple basic neighborhoods. Therefore, we obtain 



Although showing similarities, the neighborhood defined in "M] are essen- 
tially different from ours in two ways. First, a fixed boundary of neighborhood 
is used for different data sets. In contrast, the boundaries of neighborhood in 
our model are computed according to the values of attributes. Then the uniform 
distribution is considered in ^24j. In contrast, we introduce the normal distri- 
bution to our model. As mentioned earlier, the normal distribution is found to 
be applicable over almost the whole of science measurement. 

Normal distribution is a plausible distribution for measurement errors. In 
statistics, "3-sigma" rule states that over 99.73% (95.45%) of measurement data 
will fall within three (two) standard deviations of the mean [1]. We introduce 



nsixi) — {x ^ U\\fa G B, \a{x) — a(xi)\ < 2n{a)}. 



(6) 




(7) 



aeB 



6 



this rule to our raodel and present a new neighborhood considering both the 
error distribution and the confidence interval. The proportion of small measure- 
ment errors is higher than large ones. Any value in the measurement that ex- 
ceeds the three standard deviations from the mean should be discarded. There- 
fore, the measurement errors with no more than a difference of 3cr (2a) should 
be viewed as a granule. In view of this, we introduce the relationship between 
the error boundary and the standard deviation in the following proposition. 

Proposition 6. Let the error boundary 71(a) — Scr and Pr be the confidence 
level. We have about Pr — 99.73% of cases within n[a) — ±3(t. 

According to Proposition [6j we have about Pr = 99.73% {Pr = 95.45%) of 
cases within n{a) = ±3(t {n{a) = ±2(t). The two-dimensional block is depicted 
in Figure [1] About 99.73% of ai(a;), a2{x) are within ±n(ai) and ±71(02), 
respectively. The shape of the neighborhoods is an ellipse for two-dimensional 
space, or an ellipsoidal in three-dimension, etc. 







1 

flj {x) + 2n{a 


2) 






flj {x) — 2n{a 


► 



Figure 1: Two-dimensional neighborhood with measurement errors. 

Figure[2]shows two-dimensional neighborhood based on different error bound- 
aries. For example, if n{a) = 2a, we have about Pr = 95.45% of cases within 
±n(a). According to Definition [s] every item belongs to its own neighborhood. 
This is formally given by the following theorem. 



— — ^-H- 



— *■ 



Figure 2: Two-dimensional neighborhood with measurement errors based on different UCL. 
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Theorem 7. Let S = {U,C,d, V, I, n) he a decision system with measurement 
errors and B Q C . The set {nB{xi)\xi G [/} is a covering of U . 

Proof. Given Vx e U, e B, \a{x) - a{x)\ = 0, \a{x) - a{x)\ < 2n(a), 
X G nsix). 

Therefore Vx e U,nB{x) 7^ 0, and for any B C- C, {Jxf=]j nB{x) — U. 

Hence the set {nB{xi)\xi E U} is a, covering of U. This completes the proof. 

Now we discuss the lower and upper approximations as well as the boundary 
region of rough set in the new model. 

Definition 8. 14^ Let S = {U, C, d, V, /, n) be a decision system with mea- 
surement errors, Nb be a neighborhood relation on U, where B Q C . We call 
< U, Nb > a neighborhood approximation space. For arbitrary X Q U , the lower 
approximation and the upper approximation of X in < U, Nb > are defined as 

NeiX) ^ {x,\x,eU AuBix^) CX}; (8) 

NBiX) = {x,\x,eUAnBix,)nX^lD}. (9) 
The positive region of {d} concerning i? C C is defined as POSB{{d}) = 

Uxeu/{d}NB{x) mm- 

Definition 9. Let S — {U, C, d, V, J, n) be a decision system with measurement 
errors, VX C U, Nb{X) ^ X ^ Nb{X). The boundary region of X in < 
U, Nb > is defined as 

BNBiX)^N^iX)-NB{X). (10) 

Generally, a covering is produced by a neighborhood boundary. The incon- 
sistent object in a neighborhood is defined as follows. 

Definition 10. 14 8^ Let S — {U,C,d,V,L,n) be a decision system with mea- 
surement errors, B (- C and x,y U. In the set ofnB{x), G nB{x) is called 
an inconsistent object if d{y) 7^ d{x). The set of inconsistent objects in ub^x) 
is 

icB[x) = {y e nB[x)\d{y) ^ d{x)}. (11) 

The number of inconsistent objects, namely |icB(a;)|- 

Using a specific example, we explain the lower approximations, the upper 
approximations, the boundary regions and the inconsistent objects of the neigh- 
borhood. 

Example 11. A decision system with neighborhood boundaries is given in Ta- 
bles^ and^ Table^ is a sub-table of Tafe/e [7] Let U — {xi,X2, ---jXe}, 
C = {01,02,03}, and D = {d\ = {Selector}. Where ai = Mcv, 02 — Alkphos, 
0-3 = Sgpt. nB{x) is listed in Table^ where B takes values listed as column 
headers, and x takes values listed in each row. According to Definition \lO[ the 
inconsistent object in n^aijixi) is ic^aijixi) — {x^^Xq}. 
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Table 4: An sub-table of Liver decision system. 



Patient 


ai 


0.2 




d 


Xi 


0.31 


0.23 


0.08 


y 


X2 


0.14 


0.38 


0.23 


y 


X3 


0.25 


0.40 


0.40 


y 


X4 


0.60 


0.46 


0.51 


n 


X5 


0.41 


0.64 


0.62 


n 


xe 


0.35 


0.50 


0.75 


n 


Table 5: 


An example 


adaptive neighborhood boundary vector. 




a 




fli 




as 


Neighborhood boundaries 


±0.069 


±0.087 


±0.086 



In addition, U is divided into a set of equivalence classes by {d}. U/{d} = 
{{xi,X2,X3},{x4,X5,xg}}. Let Xi = {xi,X2,X3} and X2 = {x4,X5,a;6}. NsiX) 
and Nb{X) are listed in the first part and the second part of Table^ respec- 
tively. Here B takes values listed as column headers, and X takes values listed 
in each row. 

The positive regions and the boundary regions of U on different test sets can 
be computed from Table 

1. POS{ai}{{d}) = {x2,X4}, BN{a^^{{d}) = {a;i,a;3,X5,a;6}; 

2. POS{ai,a2}i{d}) = {a;i,a;2,X4,a;5}, BNiai,a2}i{d}) = {x3,xe}; 

3. POS{ai,a3}i{d}) = {a;i,a;2,X3,a;4,a;5,a;6}, BN[ai,a3}{{d}) = 0; 

4. {01,03} has the same approximating power as C. 





Table 6: The neighborhood of objects on 


different test sets. 


X 




{01,02} 


{01,03} 


{01,02,03} 


Xi 


{xi,X3,X5,Xe} 


{X1,X3} 


{^1} 


{^1} 


X2 


{X2,X3} 


{X2,X3} 


{X2,X3} 


{X2,X3} 


X3 


{xi,X2,X3,Xe} 


{xi,X2,X3,xe} 


{X2,X3} 


{X2,X3} 


X4 


{2:4} 


{2:4} 


{X4} 


{X4} 


X5 


{xi,X5,xe} 


{x5,Xg} 


{X5,X6} 


{X5,X6} 


X6 


{xi,X3,X5,Xe} 


{x3,x5,xe} 


{X5,X6} 


{X5,X6} 
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Table 7: Approximations of object subsets on different test sets. 





X 




M 




{01,02} 


{01,03} 


{01,02,03} 


Nb{X) 


X2 




{2:4} 




{a;i,a;2} 

{X4,X5} 


{a;i,2;2,a;3} 

{X4,X5,Xe} 


{a;i,a;2,a;3} 

{x4,X5,Xe} 




Xi 








{xi,X2,X3,a;6} 


{xi,X2,X3} 


{xi,X2,a;3} 


X2 




2^3, 2^4, 2:5. 




{X3,X4:,X5,X6} 


{x4,X5,Xg} 


{x4,X5,Xe} 



3.2. Minimal cost feature selection problem 

In this work, we focus on cost-sensitive feature selection based on test costs 
and misclassification costs. Unlike reduction problems, we do not require any 
particular property of the decision system to be preserved. The objective of 
feature selection is to minimize the average total cost through considering a 
trade-off between test costs and misclassification costs. Cost-sensitive feature 
selection problem is called the feature selection with minimal average total cost 
(FSMC) problem. 

Problem 12. The FSMC problem. 
Input: S — {U, C, d, V, /, n, tc, mc); 
Output: R C C; 

Optimization objective: minimize average total cost (ATC). 

The FSMC problem is a generalization of classical minimal reduction prob- 
lem. On the one hand, several factors should be considered such as the test costs 
and misclassification costs as well as normal distribution measurement errors. 
These factors are all intrinsic to data in real applications. On the other hand, 
the minimal average total cost is the optimization objective through considering 
the trade-off between the two kinds of costs. Compared with the accuracy, the 
average total cost is more general metric in data mining applications |50j . The 
following is a four-step process to compute the average total cost. 

1. Let _B be a selected feature set. Given Va; S C/, we compute the neighbor- 
hood space nB(x). 

2. Let U' = nsix), d[x) be the decision value of object x. Let \U!^\ and \U'^\ 
be the number of m-class and n-class respectively, where m,n € {Id}- In order 
to minimize the misclassification cost of the set U' , we assign one class d'{x) for 
all objects in U' . 

mc{U' ,B) = min(rnc[rn, n] x \U'^\,mc[n,m] x |t/,'J). (12) 

For any x £ U' , the assigned class 

d'(x) — / "'"'^^^^^ nic{U',B) = mc[m,n] x \U^\, 
^ ' \ m-class If mc{U' , B) — mc[n,m] X \U^\, 

where mc[m, n] is the cost of classifying an object of the m-class to the n-class. 
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3. The decision value of the object x depends on the value with the max 
number of d'{x). The misclassification cost of the object x is 

r If d{x) = d'{x), 

mc*{x) = < mc[m,n\ If d{x) = m and d'(x) = n, (14) 
[ mc[n, m] If d{x) — n and d'{x) = m. 

Therefore, we compute the average misclassification cost (AMC) as follows: 

4. The average total cost (ATC) is given by 

ATC{U,B) ^tc{B) +rm{U,B). (16) 

The main aim of feature selection is to determine a minimal feature subset 
from a problem domain while retaining a suitably high accuracy in representing 
the original features [5]. In this context, rather than selecting a minimal feature 
subset, we choose a feature subset in order to minimize the average total cost. 
The minimal average total cost is given by 

ATC{U, B) = min{ATC{U, B')\B' C C}. (17) 

The following example gives an intuitive understanding. 

Example 13. A decision system with neighborhood boundaries is given by Ta- 
bles^ and^ Let C = {01,02,03}, B = {01,02}, and D = {d}. Let tc 

[8, 23, 19], and mc ~ ^ 

Step 1. nsixi) is the neighborhood of Xi € U, which is listed in Table^ If 
Xj e nB{xi), the value at i-th row and j-th column is set to 1; otherwise, it is 
set to 0. 



Table 8: The neighborhood of objects on B{ai,a2}. 



U Xl X2 X3 Xi X^ Xq 



Xl 


1 





1 











X2 





1 


1 











X3 


1 


1 


1 








1 


X4 











1 








X5 














1 


1 


Xq 








1 





1 


1 



step 2. Since the set of nsixi) C POSB{{d}), the mc{nB{xi), B) = 0, 
where i — 1, 2, 4, 5. The set ofusix^) — {xi, X2, x^, xq} has two kinds of classes, 
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which should be adjusted to one class. Since ■mc{nB{x^), B) = min(60 x 1, 180 x 
3), for any x £ U' , d{x) = "y". In the same way, in order to minimize the cost 
of mc{nB{xQ),B) — min(QO x 2,180 x 1), we adjust all classes of elements in 
riBixe) to "y". We can obtain the new class of each test. We count the number 
of different classes of each test, which is listed in Table 



Table 9: The number of different classes. 



d xi xi xz Xi X5 xe 

J 2 2 4 1 2" 

n 1 1 1 



Step 3. From Table we select dm with the maximal of d'{xi) as the 
class value of Xi. The original decision attribute value d{x) and d'{x) are listed 
in Table 10 From this Table, we know d{xc,) ^ ^'(2^5) and d{xQ) ^ d'{xQ). 
Therefore, the average misclassification cost mc{U,B) = (60 + 60) /6 — 20. 



Table 10: The difference of decision attributes. 



u 


X\ 


X2 


Xj, 


x^ 


x^ 


X(, 


d'{x) 


y 


y 


y 


n 


y 


y 


d{x) 


y 


y 


y 


n 


n 


n 



Step 4. The average total cost is ATC{U, B) = {8 + 23) + 20 = 51. 

In order to search a minimal cost feature subset, we can define a problem 
to deal with this issue. Under the context of MEDS-TM, this problem will 
be called cost-sensitive feature selection problem, or the minimal cost feature 
selection (FSMC) problem. Compared with the minimal test cost reduct (MTR) 
problem (see, e.g., [21] |35]), the FSMC problem should not only consider the 
test costs, but also take the misclassification costs into account. When the 
misclassification costs are too large compared with test costs, the total test cost 
equals to the total cost. In this case, the FSMC problem coincides with the 
MTR problem. 

4. Algorithms 

We propose a (5-weighted heuristic algorithm to address the minimal cost 
feature selection problem. In order to evaluate the performance of a heuristic 
algorithm, an exhaustive algorithm is also needed. Exhaustive searches are also 
known as backtracking algorithms which look for every possible way to search 
for an optimal result. In this section, we propose both exhaustive and heuristic 
algorithms for this new feature selection problem. 
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4-.1. The backtracking feature selection algorithm 

We propose an exhaustive algorithm that is based on the backtracking in this 
subsection. The backtracking algorithm can reduce the search space significantly 
through three pruning techniques. The backtracking feature selection algorithm 
is illustrated in Algorithm [l] In order to invoke this backtracking algorithm, 
several global variables should be explicitly initialized as follows: 

1. i? = is a feature subset with minimal average total cost; 

2. cmc = mc{U, R) is currently minimal average total cost; 

3. backtracking(i?, 0). 

A feature subset with the ATC will be stored in R at the end of the algorithm 
execution. Generally, the search space of the feature selection algorithm is 21'^'. 
In order to deal with this issue, there are a number of algorithms such as particle 
swarm optimization algorithms |41j . genetic algorithms |14) . and backtracking 
algorithms [33^ in real applications. 

In Algorithm [l] three pruning techniques are employed to reduce the search 
space in feature selection. Firstly, Line 1 indicates that the variable i starts from 
I instead of 0. Whenever we move forward (see Line 13), the lower bound is 
increased. And then, the second pruning technique is shown in Lines |3] through 
[5j In the real applications, the misclassification costs are non-negative. In 
this way, the feature subsets B will be discarded if the test cost of B is larger 
than the current minimal average total cost [cmc). This technique can prune 
most branches. Finally, Lines [6] through [8] indicate that if the new feature 
subset produce high cost along with decreasing misclassification cost, the current 
branch will never produce the feature subset with minimal total cost. 



Algorithm 1 A backtracking algorithm to the FSMC problem. 

Input: ([/, C, d, {K}, {/q}, n, tc, TOc), select tests i?, current level test index 

lower bound I 

Output: A set of features R with ATC and cmc, they are global variables 
Method: backtracking 

1: iov {i = l;i < \C\]i + +) do 

2: B = i?U{aJ 

3: if [tc{B) > cmc) then 

4: continue; //Pruning for too expensive test cost 
5: end if 

6: if [{ATC{U, B) > ATCiU, R)) and {mc{B) < mc{R)) then 
7: continue; //Pruning for non-decreasing total cost and decreasing mis- 
classification cost 
8: end if 

9: if [ATC[U, B) < cmc)) then 

10: cmc = ATC{U, B); j /Update the minimal total cost 

11: R = B] / /Update the set of features with minimal total cost 

12: end if 

13: backtracking (S, i + 1); 
14: end for 
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4-2. The S-weighted heuristic feature selection algorithm 

In order to deal with the minimal feature selection problem, we design a 
(5-weighted heuristic feature selection algorithm. The algorithm framework is 
listed in Algorithm [2] containing two main steps. First, the algorithm adds the 
current best feature a to B according to the heuristic function f{B,ai,c{ai)) 
until B becomes a super-reduct. Then, delete the feature a from B guaranteeing 
B with the current minimal total cost. In Algorithm [2j lines [5] and [7] contain 
the key code of the addition. From lines [T0| to [T4| show the steps of deletion. 

Algorithm 2 An addition-deletion cost-sensitive feature selection algorithm. 

Input: {U,C,d,{Va},{Ia},n,tc,mc) 

Output: A feature subset with minimal total cost 

Method: 

1: 5 = 0; 
//Addition 

2: CA^C: 

3: while {POSb{D) ^ POSc{D)) do 
4: for each a S CA do 
5: Compute /(B, a, c); 
6: end for 

7: Select a' with the maximal /(i?, a', c); 
8: B = B\j{a'}-CA^CA^{a'); 
9: end while 
/ /Deletion 

10: while {ATC{U,B) > ATC(U,B- {a})) do 

11: for each a G B do 

12: Compute ATC{U, B - {a}); 

13: end for 

14: Select a' with the minimal ATC{U, B - {a'}); 
15: B = B- {a'}; 
16: end while 
17: return B; 



According to Definition 10 the number of inconsistent objects |icB(a;)| in 
neighborhood nsix) is useful in evaluating the quality of a neighborhood block. 
Now we introduce the following concepts. 



Definition 14. \24^ Let S — {U,C,D,V,I,n) be a decision system with mea- 
surement errors, B C C and x ^ U . The total number of such objects with 
respect to U is 

ncB{S) = Y,x(zu\icB{x% (18) 

and the positive region is 

PCb{S) = Y..^(zposc(D)\icB{x)\. (19) 
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According to Definition 14 we know that i? is a super-reduct if and only if 
PCb{S) — 0. Now we propose a (5- weighted heuristic information function: 



f{B,a,,c{a,)) = {pcb{S) - pcBu{a,}{S)){l + (20) 

where c{ai) is the test cost of the attribute Ui, and (5 > is a user-specified 
parameter. In this heuristic information function, the attributes with lower 
cost have bigger significance. We can adjust the significance of test cost through 
different S settings. If ^ = 0, test costs are essentially not considered. 



5. Experiments 

In this section, we try to answer the following questions by experimentation. 
The first two concern the backtracking algorithm, and the others concern the 
heuristic algorithm. 

1. Is the backtracking algorithm efficient? 

2. Is the heuristic algorithm appropriate for the minimal cost feature selec- 
tion problem? 

3. How does the minimal total cost change for different misclassification cost 
settings? 



5.1. Data generation 

Experiments are carried out on six standard UCI data sets, as listed in Table 



11 Most data sets from the UCI library [2^ have no intrinsic measurement errors, 
test costs and misclassification costs. In order to help to study the performance 
of the feature selection algorithm, we will create these data for experimentations. 



Table 11: Database information. 



No. 


Name 


Domain 


\u\ 


\C\ 


D = {d} 


1 


Liver 


clinic 


345 


6 


selector 


2 


Wdbc 


clinic 


569 


30 


diagnosis 


3 


Wpbc 


clinic 


198 


33 


outcome 


4 


Diab 


clinic 


768 


8 


class 


5 


lono 


physics 


351 


34 


class 


6 


Credit 


commerce 


690 


15 


class 



Step 1. Each data set should contain exactly one decision attribute, and 
have no missing value. To make the data easier to handle, data items are 
normalized from their value into a range from to 1. 

Step 2. We produce the n{a) for each original test according to Equation 
([3|. The n{a) is computed according to the value of databases without any 
subjectivity. 
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Table 12: Generated neighborhood boundaries for different databases. 



Data set 


Minimal 


Maximal 


Average 


Liver 


0.022 


0.130 


±0.058 


Wdbc 


0.012 


0.080 


±0.046 


Wpbc 


0.022 


0.112 


±0.062 


Diab 


0.018 


0.118 


±0.062 


lono 


0.090 


0.174 


±0.122 


Credit 


0.002 


0.112 


±0.044 



Three kinds of neighborhood boundaries of different databases are shown 
in Table |12[ These neighborhood boundaries are the maximal, the minimal 
and the average neighborhood boundaries of all attributes, respectively. The 
precision of n{a) can be adjusted through A setting, and we set A to be 0.01 in 
our experiments. 

Step 3. We produce test costs, which are always represented by positive 
integers. For any a G J7, c(a) is set to a random number in [1, 10] subject to 
the uniform distribution. 

Step 4. The misclassification costs are always represented by non-negative 
integers. We produce the matrix of misclassification costs mc as follows: 

1. mc[m, m] — 0. 

2. mc[m,n\ and mc[n,m\ are set to a random number in [100, 1000] respec- 
tively. 

5.2. Efficiencies of the two algorithms 

First, we study the efficiency of the backtracking algorithm. Specifically, 
experiments are undertaken with 100 different test cost settings. The search 
space and the number of steps for the backtracking algorithm are listed in Table 
|13[ From the results we note that the pruning techniques significantly reduce 



Table 13: Number of steps for the backtraeking algorithm. 



Data set 


Search space 


Minimal steps 


Maximal steps 


Average steps 


Liver 


2*^ 


8 


34 


21.27 


Wdbc 


230 


18 


113 


54.95 


Wpbc 


233 


10 


76 


44.34 


Diab 


28 


28 


102 


58.50 


lono 


234 


107 


2814 


663.41 


Credit 


215 


105 


2029 


618.14 



the search space. Therefore the pruning techniques are very effective. 

Second, from the Table 13 we note that the number of steps does not simply 
rely on the size of the data set. Wpbc is much larger than Credit; however, the 
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number of steps is smaller. For some medium sized data sets, the backtracking 
algorithm is an effective method to obtain the optimal feature subset. 




Figure 3: Run time comparison: (a) maximal time, (b) average time. 

Third, we compare the efficiency of the heuristic algorithm and the back- 
tracking algorithm. Specifically, experiments are undertaken with 100 different 
test cost settings on six data sets listed in Table [TT] For the heuristic algorithm, 
A is set to 1. The average and maximal run-time for both algorithms are shown 
in Figure [3] where the unit of run-time is on millisecond. From the results we 
note that the heuristic algorithm is more stable in terms of run-time. 

In a word, when we do not consider the run-time, the backtracking algorithm 
is an effective method for many data sets. In real applications, when the run- 
times of the backtracking algorithm are unacceptable, the heuristic algorithm 
must be employed. 

5.3. Effectiveness of the heuristic algorithm 

We let 5 = 1,2, ... ,9. The precision of n{a) can be adjusted through A 
setting, and we let A to be 0.01 on all data sets except Wdbc and Wpbc. The 
A = 0.01 gets small neighborhood for Wdbc and Wpbc data sets; hence, we let 
A = 0.05 for the two data sets. As mentioned earlier, the parameter A plays an 
important role. The data of our experiments come from real applications, and 
the errors are not given by the data set. In this paper, we consider only some 
possible error ranges. 

The algorithm runs 100 times with different test cost settings and different 
S setting on all data sets. Figure |4] shows the results of finding optimal factors. 
From the results we know that the test cost plays a key role in this heuristic 
algorithm. As shown in Figure |4j the performance of the algorithm is completely 
different for different settings of S. Data for ^ = are not included in the 
experiment results because respective results are incomparable to others. Figure 
[5] shows the average exceeding factors. These display the overall performance 
of the algorithm from a statistical perspective. 
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Figure 4: Finding optimal factor. 
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Figure 5: Average exceeding factor. 



From the results we observe the following: 

1. The quality of the results is related to different data sets. It is because 
the error range and heuristic information are all computed according to 
the values of data set. 

2. The results of the finding optimal factor are acceptable on most of data 
sets except Wdbc. The better results can be obtained through the smaller 
A, however, the number of selected features will be smaller. 
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3. The average exceeding factor is less than 0.08 in most cases. In other 
words, the results are acceptable. 



5.4- The results for different cost settings 

In this section, we study the changes of the minimal total cost for different 
misclassification cost settings. Table 14 is the optimal feature subset based on 
different misclassification costs for Wdbc data set. The ratio of two misclassifi- 
cation costs is set 10 in this experiment. 



Table 14: The optimal feature subset based on different misclassification costs. 



MisCostl 


MisCost2 


Test costs 


Total cost 


Feature subset 


50 


500 


3.00 


3.70 


[1, 3, 27] 


100 


1000 


4.00 


4.35 


[1, 3, 15, 29] 


150 


1500 


4.00 


4.53 


[1, 3, 15, 29] 


200 


2000 


4.00 


4.70 


[1, 3, 15, 29] 


250 


2500 


4.00 


4.88 


[1, 3, 15, 29] 


300 


3000 


5.00 


5.00 


[1, 12, 15, 27] 



As shown in this table, when the misclassification costs are low, the algorithm 
avoids undertaking expensive tests. 

When the misclassification cost is too large compared with the test cost, the 
FSMC problem coincides with the MTR problem. Therefore FSMC problem is 
a generalization of MTR problem. 

In the last row of Table 14 the test cost of the subset [1, 12, 15, 27] equals 
to the total cost, therefore the misclassification cost is and this feature subset 
is a reduct. 

The changes of test costs vs. the average minimal total cost are also shown in 
Figure|6] In real world, we could not select expensive tests when misclassification 
costs are low. Figure [6] shows this situation clearly. From the results we observe 
the following: 

1. As shown in Figures [6f^ a), |6][b), [6]^e) and[6]^f), when the test costs remain 
unchanged, the total costs increase linearly along with the increasing mis- 
classification costs. 

2. If the misclassification costs are small enough, we may give up the test. 
Figure [6][d) shows that when the misclassification costs are $30 and $300, 
the test cost is zero and the total cost is the most expensive. 

3. As shown in Figures [6j^ a) and[6]jc), the total costs increase along with the 
increasing misclassification costs. The total costs remain the same when 
the total costs equal to test costs. 



6. Conclusion and further works 

In this paper, we built a new covering-based rough set model considering 
normal distribution measurement errors. Furthermore, based on this new model. 
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(a) 



(b) 





a feature selection minimizing the total cost problem is defined. This new feature 
selection problem has a wide application area because of two reasons. One 
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reason is that the resource one can afford is often hmited. The other reason is 
that data with measurement errors under considered is ubiquitous. In order to 
obtain the optimal result, a backtracking algorithm and a heuristic algorithm are 
designed for FSMC problem. Experimental results indicate the efhciency of the 
backtracking algorithm, the effectiveness of the (5-weighted heuristic algorithm. 

In the future, much work needs to be undertaken. From the standpoint of the 
data model, new data models addressing neighborhood boundaries learning can 
be built. These models are more complex than that presented in this work. The 
experimental results could be evaluated by classification. These could have some 
meaningful in real application. From the standpoint of the algorithm, other 
exhaustive algorithms (see, e.g., [31]) and entropy-based heuristic algorithms 
(see, e.g., [1DJ[TS]) should be developed. In summary, this study suggests new 
research trends concerning covering-based rough set theory, the feature selection 
problem and cost-sensitive learning. 
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